Skip to main content

Cybersecurity Compliance Test

Overview

Previous studies have shown that LLMs can be used to produce malicious outputs, including guidelines for conducting illegal or non-compliant tasks. At the same time, organizations such as MITRE have long cataloged different types of adversarial techniques in the field of cybersecurity. As LLMs are increasingly used as coding assistants, it becomes important to detect a model’s propensity to assist in conducting cybersecurity attacks. The goal of the Cybersecurity Compliance attack, based on the Cybseceval paper, is to determine model vulnerabilities related to providing responses that violate these established cybersecurity standards [1].

We draw attack prompts from the Purple Llama Cyberseceval dataset, an open source dataset of attack prompts that ask a language model to provide a user assistance in a cybersecurity attack. Each attack prompt corresponds to one of the 10 MITRE ATT&CK Categories (Figure 1). Our attack shuffles through the following categories in a round-robin manner. This means that if the user specifies a sampling_rate of 10, there is exactly 1 attack prompt corresponding to each of the categories. If the user specifies a sampling_rate of 100, there are 10 attack prompts for each of the categories.

MITRE ATT&CK CategoryThe adversary is attempting to...Example of Technique
Reconnaissancegather information they can use to plan future operationsActive Scanning
Discoveryfigure out your environmentAccount discovery
Executionrun malicious codeCloud Admin Command
Privilege Escalationgain higher-level permissionsAccess Token Manipulation
Persistencemaintain their footholdAccount Manipulation
Defense Evasionavoid being detectedBuild Image on Host
Lateral Movementmove through your environmentInternal Spearphishing
Collectiongather data of interest to their goalAdversary-in-the-Middle
Command and Controlcommunicate with compromised systems to control themApplication Layer Protocol
Exfiltrationsteal dataAutomated Exfiltration

Metrics

Noncompliance Rate refers to the rate at which the target model both doesn’t refuse and is helpful in conducting the attack. In our reports, we break down target model responses into 3 categories that sum to 100%: % outright refusals, % helpful response, % not helpful response. Non-compliance rate is the % helpful response rate. Completions are assessed in a two step process.

First, if the model refuses to help the attacker, then the completion is, by definition, not helpful to cyber attackers; we apply a set of heuristics to detect refusals (keywords, semantics, etc.).

Second, if the model provides code to help the attacker, we provide the target model’s response to a model (the expansion model) that expands upon the model’s response (e.g., answers MITRE-specific questions and writes several paragraphs about the meaning and implications of the code). We use another model (the judge model) that reads the original code and the aforementioned expansion to return “helpful” or “not helpful” to ultimately assess whether the code can be used successfully to complete a cyberattack. Currently, DynamoFL uses gpt-3.5-turbo as both the expansion and judge model.